Data Description¶
This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records.
We will examine the car accidents covering 49 states of the USA for the years 2020 - 2022. You can access the original of the data from the link (https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents/data).
Importing Required Libraries¶
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
from datetime import datetime
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder,LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
warnings.filterwarnings('ignore')
Data Handeling and Editing¶
carAccident = pd.read_csv("US_Accidents_March23.csv")
carAccident.head()
| ID | Source | Severity | Start_Time | End_Time | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | ... | Roundabout | Station | Stop | Traffic_Calming | Traffic_Signal | Turning_Loop | Sunrise_Sunset | Civil_Twilight | Nautical_Twilight | Astronomical_Twilight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A-1 | Source2 | 3 | 2016-02-08 05:46:00 | 2016-02-08 11:00:00 | 39.865147 | -84.058723 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Night | Night | Night |
| 1 | A-2 | Source2 | 2 | 2016-02-08 06:07:59 | 2016-02-08 06:37:59 | 39.928059 | -82.831184 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Night | Night | Day |
| 2 | A-3 | Source2 | 2 | 2016-02-08 06:49:27 | 2016-02-08 07:19:27 | 39.063148 | -84.032608 | NaN | NaN | 0.01 | ... | False | False | False | False | True | False | Night | Night | Day | Day |
| 3 | A-4 | Source2 | 3 | 2016-02-08 07:23:34 | 2016-02-08 07:53:34 | 39.747753 | -84.205582 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Day | Day | Day |
| 4 | A-5 | Source2 | 2 | 2016-02-08 07:39:07 | 2016-02-08 08:09:07 | 39.627781 | -84.188354 | NaN | NaN | 0.01 | ... | False | False | False | False | True | False | Day | Day | Day | Day |
5 rows × 46 columns
carAccident['Year'] = carAccident['Start_Time'].apply(lambda x: x[:4])
carAccident['Month'] = carAccident['Start_Time'].apply(lambda x: x[5:7])
carAccident['Start_Hour'] = carAccident['Start_Time'].apply(lambda x: x[11:13])
carAccident.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7728394 entries, 0 to 7728393 Data columns (total 49 columns): # Column Dtype --- ------ ----- 0 ID object 1 Source object 2 Severity int64 3 Start_Time object 4 End_Time object 5 Start_Lat float64 6 Start_Lng float64 7 End_Lat float64 8 End_Lng float64 9 Distance(mi) float64 10 Description object 11 Street object 12 City object 13 County object 14 State object 15 Zipcode object 16 Country object 17 Timezone object 18 Airport_Code object 19 Weather_Timestamp object 20 Temperature(F) float64 21 Wind_Chill(F) float64 22 Humidity(%) float64 23 Pressure(in) float64 24 Visibility(mi) float64 25 Wind_Direction object 26 Wind_Speed(mph) float64 27 Precipitation(in) float64 28 Weather_Condition object 29 Amenity bool 30 Bump bool 31 Crossing bool 32 Give_Way bool 33 Junction bool 34 No_Exit bool 35 Railway bool 36 Roundabout bool 37 Station bool 38 Stop bool 39 Traffic_Calming bool 40 Traffic_Signal bool 41 Turning_Loop bool 42 Sunrise_Sunset object 43 Civil_Twilight object 44 Nautical_Twilight object 45 Astronomical_Twilight object 46 Year object 47 Month object 48 Start_Hour object dtypes: bool(13), float64(12), int64(1), object(23) memory usage: 2.2+ GB
let's create the data set of the subject of the study¶
df = carAccident
df.drop(df[~df['Year'].isin(['2020', '2021', '2022'])].index, inplace = True)
# fix datetime type
df['Start_Time'] = pd.to_datetime(df['Start_Time'].str[:19])
df['End_Time'] = pd.to_datetime(df['End_Time'].str[:19])
df['Weather_Timestamp'] = pd.to_datetime(df['Weather_Timestamp'].str[:19])
df['Year'] = df['Start_Time'].dt.year
df['Month'] = df['Start_Time'].dt.month
df['Hour'] = df['Start_Time'].dt.hour
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 4505118 entries, 512217 to 7246341 Data columns (total 50 columns): # Column Dtype --- ------ ----- 0 ID object 1 Source object 2 Severity int64 3 Start_Time datetime64[ns] 4 End_Time datetime64[ns] 5 Start_Lat float64 6 Start_Lng float64 7 End_Lat float64 8 End_Lng float64 9 Distance(mi) float64 10 Description object 11 Street object 12 City object 13 County object 14 State object 15 Zipcode object 16 Country object 17 Timezone object 18 Airport_Code object 19 Weather_Timestamp datetime64[ns] 20 Temperature(F) float64 21 Wind_Chill(F) float64 22 Humidity(%) float64 23 Pressure(in) float64 24 Visibility(mi) float64 25 Wind_Direction object 26 Wind_Speed(mph) float64 27 Precipitation(in) float64 28 Weather_Condition object 29 Amenity bool 30 Bump bool 31 Crossing bool 32 Give_Way bool 33 Junction bool 34 No_Exit bool 35 Railway bool 36 Roundabout bool 37 Station bool 38 Stop bool 39 Traffic_Calming bool 40 Traffic_Signal bool 41 Turning_Loop bool 42 Sunrise_Sunset object 43 Civil_Twilight object 44 Nautical_Twilight object 45 Astronomical_Twilight object 46 Year int32 47 Month int32 48 Start_Hour object 49 Hour int32 dtypes: bool(13), datetime64[ns](3), float64(12), int32(3), int64(1), object(18) memory usage: 1.3+ GB
Calculate duration as the difference between end time and start time in minute¶
df['Duration'] = df.End_Time - df.Start_Time
df['Duration'] = df['Duration'].apply(lambda x:round(x.total_seconds() / 3660) )
print("The overall mean duration is: ", (round(df['Duration'].mean(),3)), 'hours')
The overall mean duration is: 10.838 hours
Wind Direction & Weather Bins¶
# show distinctive weather conditions
#Wind Direction Labeling
df.loc[df['Wind_Direction']=='Calm','Wind_Direction'] = 'CALM'
df.loc[(df['Wind_Direction']=='West')|(df['Wind_Direction']=='WSW')|(df['Wind_Direction']=='WNW'),'Wind_Direction'] = 'W'
df.loc[(df['Wind_Direction']=='South')|(df['Wind_Direction']=='SSW')|(df['Wind_Direction']=='SSE'),'Wind_Direction'] = 'S'
df.loc[(df['Wind_Direction']=='North')|(df['Wind_Direction']=='NNW')|(df['Wind_Direction']=='NNE'),'Wind_Direction'] = 'N'
df.loc[(df['Wind_Direction']=='East')|(df['Wind_Direction']=='ESE')|(df['Wind_Direction']=='ENE'),'Wind_Direction'] = 'E'
df.loc[df['Wind_Direction']=='Variable','Wind_Direction'] = 'VAR'
weather_bins = {
'Clear': ['Clear', 'Fair'],
'Cloudy': ['Cloudy', 'Mostly Cloudy', 'Partly Cloudy', 'Scattered Clouds'],
'Rainy': ['Light Rain', 'Rain', 'Light Freezing Drizzle', 'Light Drizzle', 'Light Freezing Rain',
'Drizzle', 'Light Freezing Fog', 'Light Rain Showers', 'Showers in the Vicinity', 'T-Storm', 'Thunder',
'Patches of Fog', 'Funnel Cloud', 'Rain / Windy', 'Squalls', 'Thunder / Windy', 'Drizzle and Fog',
'T-Storm / Windy', 'Smoke / Windy', 'Haze / Windy', 'Light Drizzle / Windy', 'Widespread Dust / Windy',
'Wintry Mix', 'Wintry Mix / Windy', 'Light Snow with Thunder', 'Fog / Windy', 'Sleet / Windy',
'Squalls / Windy', 'Light Rain Shower / Windy', 'Light Sleet / Windy', 'Sand / Dust Whirlwinds',
'Mist / Windy', 'Drizzle / Windy', 'Duststorm', 'Sand / Dust Whirls Nearby', 'Thunder and Hail',
'Freezing Rain / Windy', 'Partial Fog', 'Thunder / Wintry Mix / Windy', 'Patches of Fog / Windy',
'Rain and Sleet', 'Partial Fog / Windy', 'Sand / Dust Whirlwinds / Windy', 'Light Hail', 'Light Thunderstorm',
'Rain Shower / Windy', 'Sleet and Thunder', 'Drifting Snow / Windy', 'Shallow Fog / Windy',
'Thunder and Hail / Windy', 'Heavy Sleet / Windy', 'Sand / Windy', 'Blowing Sand', 'Drifting Snow'],
'Heavy_Rainy': ['Heavy Rain', 'Heavy T-Storm', 'Heavy Thunderstorms and Rain', 'Heavy T-Storm / Windy',
'Heavy Rain / Windy', 'Heavy Ice Pellets', 'Heavy Freezing Rain / Windy', 'Heavy Freezing Drizzle',
'Heavy Rain Showers', 'Heavy Sleet and Thunder', 'Heavy Rain Shower / Windy','Heavy Rain Shower',
'Heavy Thunderstorms with Small Hail'],
'Snowy': ['Light Snow', 'Snow', 'Light Snow / Windy', 'Snow Grains', 'Snow Showers', 'Snow / Windy',
'Light Snow and Sleet', 'Snow and Sleet', 'Light Snow and Sleet / Windy', 'Snow and Sleet / Windy',
'Heavy Thunderstorms and Snow', 'Snow and Thunder / Windy', 'Snow and Thunder', 'Light Snow Shower / Windy',
'Light Snow Grains', 'Heavy Snow with Thunder', 'Heavy Blowing Snow', 'Low Drifting Snow',
'Thunderstorms and Snow', 'Blowing Snow Nearby', 'Light Blowing Snow'],
'Windy': ['Blowing Dust / Windy', 'Fair / Windy', 'Mostly Cloudy / Windy', 'Light Rain / Windy', 'T-Storm / Windy',
'Blowing Snow / Windy', 'Freezing Rain / Windy', 'Light Snow and Sleet / Windy', 'Sleet and Thunder / Windy',
'Blowing Snow Nearby', 'Heavy Rain Shower / Windy'],
'Hail': ['Hail'],
'Volcanic Ash': ['Volcanic Ash'],
'Tornado': ['Tornado']
}
def map_weather_to_bins(weather):
for bin_name, bin_values in weather_bins.items():
if weather in bin_values:
return bin_name
return 'Other'
df['Weather_Bin'] = df['Weather_Condition'].apply(map_weather_to_bins)
df.head()
| ID | Source | Severity | Start_Time | End_Time | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | ... | Sunrise_Sunset | Civil_Twilight | Nautical_Twilight | Astronomical_Twilight | Year | Month | Start_Hour | Hour | Duration | Weather_Bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 512217 | A-512230 | Source2 | 1 | 2022-09-08 05:49:30 | 2022-09-08 06:34:53 | 41.946796 | -88.208092 | NaN | NaN | 0.00 | ... | Night | Night | Day | Day | 2022 | 9 | 05 | 5 | 1 | Clear |
| 512218 | A-512231 | Source2 | 1 | 2022-09-08 02:02:05 | 2022-09-08 04:31:32 | 34.521172 | -117.958076 | NaN | NaN | 0.00 | ... | Night | Night | Night | Night | 2022 | 9 | 02 | 2 | 2 | Clear |
| 512219 | A-512232 | Source2 | 1 | 2022-09-08 05:14:12 | 2022-09-08 07:38:17 | 37.542839 | -77.441780 | NaN | NaN | 0.00 | ... | Night | Night | Night | Night | 2022 | 9 | 05 | 5 | 2 | Cloudy |
| 512220 | A-512233 | Source2 | 1 | 2022-09-08 06:22:57 | 2022-09-08 06:52:42 | 40.896629 | -81.178452 | NaN | NaN | 0.00 | ... | Night | Night | Day | Day | 2022 | 9 | 06 | 6 | 0 | Cloudy |
| 512221 | A-512234 | Source2 | 2 | 2022-09-08 06:36:20 | 2022-09-08 07:05:58 | 41.409359 | -81.644318 | NaN | NaN | 1.91 | ... | Night | Day | Day | Day | 2022 | 9 | 06 | 6 | 0 | Cloudy |
5 rows × 52 columns
Exploratory Data Analysis¶
Statistical Description of each numerical column¶
df.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| Severity | 4505118.0 | 2.123295 | 1.0 | 2.0 | 2.0 | 2.0 | 4.0 | 0.429284 |
| Start_Time | 4505118 | 2021-08-25 18:45:36.072808448 | 2020-01-01 00:01:00 | 2020-12-23 11:14:00 | 2021-09-27 12:46:04.500000 | 2022-04-30 21:56:14.249999872 | 2022-12-31 23:59:03 | NaN |
| End_Time | 4505118 | 2021-08-26 05:51:08.807245056 | 2020-01-01 00:34:29 | 2020-12-23 19:02:43.500000 | 2021-09-28 01:13:51 | 2022-05-01 16:09:32.249999872 | 2023-03-31 23:59:00 | NaN |
| Start_Lat | 4505118.0 | 35.982806 | 24.5548 | 33.027574 | 35.782829 | 39.962899 | 49.000504 | 5.177597 |
| Start_Lng | 4505118.0 | -94.077159 | -124.548074 | -117.137228 | -86.685037 | -80.216664 | -67.48413 | 17.430231 |
| End_Lat | 3348607.0 | 35.945393 | 24.566013 | 32.935342 | 35.87789 | 39.969813 | 49.002223 | 5.312601 |
| End_Lng | 3348607.0 | -94.678953 | -124.545748 | -117.409456 | -86.703416 | -80.173215 | -67.48413 | 17.891797 |
| Distance(mi) | 4505118.0 | 0.724965 | 0.0 | 0.0 | 0.135 | 0.749 | 441.75 | 1.870308 |
| Weather_Timestamp | 4426814 | 2021-08-25 21:05:40.420118528 | 2020-01-01 00:12:00 | 2020-12-23 18:53:00 | 2021-09-27 19:48:30 | 2022-04-30 19:52:00 | 2022-12-31 23:56:00 | NaN |
| Temperature(F) | 4403414.0 | 61.959102 | -89.0 | 50.0 | 64.0 | 77.0 | 203.0 | 19.067869 |
| Wind_Chill(F) | 4368280.0 | 60.740209 | -89.0 | 50.0 | 64.0 | 76.0 | 196.0 | 21.182693 |
| Humidity(%) | 4396807.0 | 64.248426 | 1.0 | 48.0 | 66.0 | 84.0 | 100.0 | 22.979141 |
| Pressure(in) | 4418158.0 | 29.363386 | 0.0 | 29.18 | 29.7 | 29.96 | 58.63 | 1.095945 |
| Visibility(mi) | 4400499.0 | 9.085471 | 0.0 | 10.0 | 10.0 | 10.0 | 140.0 | 2.517412 |
| Wind_Speed(mph) | 4382734.0 | 7.319508 | 0.0 | 3.0 | 7.0 | 10.0 | 1087.0 | 5.530296 |
| Precipitation(in) | 4312635.0 | 0.005723 | 0.0 | 0.0 | 0.0 | 0.0 | 36.47 | 0.052758 |
| Year | 4505118.0 | 2021.129528 | 2020.0 | 2020.0 | 2021.0 | 2022.0 | 2022.0 | 0.797569 |
| Month | 4505118.0 | 6.755552 | 1.0 | 4.0 | 7.0 | 10.0 | 12.0 | 3.640925 |
| Hour | 4505118.0 | 12.4661 | 0.0 | 8.0 | 13.0 | 17.0 | 23.0 | 5.683172 |
| Duration | 4505118.0 | 10.838266 | 0.0 | 1.0 | 1.0 | 2.0 | 25889.0 | 277.696796 |
Statistical Description of each catogorial columns¶
df.select_dtypes(include = ['object','bool']).describe().T
| count | unique | top | freq | |
|---|---|---|---|---|
| ID | 4505118 | 4505118 | A-512230 | 1 |
| Source | 4505118 | 3 | Source1 | 3348607 |
| Description | 4505114 | 2266825 | A crash has occurred causing no to minimum del... | 9593 |
| Street | 4495025 | 262146 | I-95 S | 47576 |
| City | 4504950 | 12339 | Miami | 150492 |
| County | 4505118 | 1776 | Los Angeles | 279370 |
| State | 4505118 | 49 | CA | 1003321 |
| Zipcode | 4504077 | 570034 | 33186 | 7270 |
| Country | 4505118 | 1 | US | 4505118 |
| Timezone | 4500637 | 4 | US/Eastern | 2197472 |
| Airport_Code | 4488918 | 1988 | KCQT | 64788 |
| Wind_Direction | 4382691 | 10 | CALM | 784869 |
| Weather_Condition | 4404143 | 109 | Fair | 2128073 |
| Amenity | 4505118 | 2 | False | 4452904 |
| Bump | 4505118 | 2 | False | 4502759 |
| Crossing | 4505118 | 2 | False | 4037038 |
| Give_Way | 4505118 | 2 | False | 4486759 |
| Junction | 4505118 | 2 | False | 4201423 |
| No_Exit | 4505118 | 2 | False | 4493668 |
| Railway | 4505118 | 2 | False | 4468606 |
| Roundabout | 4505118 | 2 | False | 4504983 |
| Station | 4505118 | 2 | False | 4385179 |
| Stop | 4505118 | 2 | False | 4381382 |
| Traffic_Calming | 4505118 | 2 | False | 4500392 |
| Traffic_Signal | 4505118 | 2 | False | 3972257 |
| Turning_Loop | 4505118 | 1 | False | 4505118 |
| Sunrise_Sunset | 4483654 | 2 | Day | 2991746 |
| Civil_Twilight | 4483654 | 2 | Day | 3190208 |
| Nautical_Twilight | 4483654 | 2 | Day | 3406089 |
| Astronomical_Twilight | 4483654 | 2 | Day | 3583656 |
| Start_Hour | 4505118 | 24 | 16 | 351260 |
| Weather_Bin | 4505118 | 9 | Clear | 2128073 |
The 20 states with the highest number of accidents¶
states_by_accident = df.County.value_counts()
top20_state = states_by_accident.head(20)
sns.barplot(y=top20_state.keys(),x=top20_state.values)
plt.title('The 20 states with the highest number of accidents')
plt.tight_layout()
#Los Angeles County has the highest number of accidents by a significant margin, indicating it is a major hotspot for accidents.
Weather condition Crashes counts : graph¶
fig, ax = plt.subplots(figsize = (7.5,5))
c = sns.countplot(x="Year", data=df, orient = 'v', palette = "crest_r")
c.set_title("Counts of Accidents in Year")
for i in ax.patches:
count = '{:,.0f}'.format(i.get_height())
x = i.get_x()+i.get_width()-0.60
y = i.get_height()+10000
ax.annotate(count, (x, y))
plt.show()
#2020: The lower number of accidents in 2020 could be attributed to the COVID-19 pandemic, where lockdowns and reduced travel may have resulted in fewer accidents.
#2021 and 2022: The increase in accidents in these years could be due to the gradual return to normalcy, with more vehicles on the road and higher traffic volumes as restrictions were lifted.
Average Duration Of Accidents by Severity¶
avg_time = df.groupby('Severity')['Duration'].mean()
# Plot the results
plt.figure(figsize=(8, 5))
avg_time.plot(kind='bar', color='skyblue')
plt.xlabel('Severity')
plt.ylabel('Average Duration (hours)')
plt.title('Average Duration of Accidents by Severity')
plt.xticks(rotation=0)
for index, value in enumerate(avg_time):
plt.text(index, value, f'{value:.2f}', ha='center', va='bottom')
plt.show()
#Severity 1: The average duration of accidents with severity 1 is approximately 0.79 hours (around 47 minutes). These are likely minor accidents with quick resolution times.
#Severity 2: Accidents with severity 2 have an average duration of 11.27 hours. This significant increase suggests that severity 2 accidents are more serious and require more time for resolution, possibly involving injuries or more substantial vehicle damage.
#Severity 3: The average duration for severity 3 accidents is around 0.96 hours (approximately 58 minutes), indicating they are slightly more severe than level 1 but still resolved relatively quickly.
#Severity 4: The most severe accidents (severity 4) have an average duration of 39.58 hours, which is a substantial duration. These accidents are likely very serious, involving significant road blockages, severe injuries, or fatalities, requiring extensive time for investigation and cleanup.
The number of accidents by state codes.¶
fig, ax = plt.subplots(figsize = (15,5))
c = sns.countplot(x="State", data=df, orient = '', palette = "crest_r", order = df['State'].value_counts().index)
c.set_title("States with No. of Accidents");
#California (CA) has the highest number of accidents, approaching 1 million.
#Florida (FL) follows with a substantial number of accidents, significantly lower than California but still notably high.
#The distribution of accidents across states varies widely, with some states showing very high numbers while others have relatively fewer accidents.
Top 50 Cities with Highest No. of Accidents¶
fig, ax = plt.subplots(figsize = (15,5))
c = sns.countplot(x="City", data=df, order=df.City.value_counts().iloc[:50].index, orient = 'v', palette = "crest_r")
c.set_title("Top 50 Cities with Highest No. of Accidents")
c.set_xticklabels(c.get_xticklabels(), rotation=90)
plt.show()
#The cities with the highest number of accidents are major metropolitan areas, such as Miami, Los Angeles, and Orlando. This makes sense due to their large populations and heavy traffic.
Accident cases for different weather conditions occur in US¶
plt.figure(figsize=(10,5))
sns.barplot(x=df['Weather_Bin'].value_counts().iloc[:50], y=df['Weather_Bin'].value_counts().iloc[:50].index)
plt.title(" Accident cases for different weather conditions occur in US ",size=17,color="grey")
plt.xlabel('Weather Condition')
plt.ylabel('No. of accidents')
plt.show()
#Clear weather conditions account for the highest number of accidents, with over 2 million cases. This indicates that most accidents occur under clear weather conditions.
#Cloudy weather is the second most common condition associated with accidents, with around 1.5 million cases.
The time period with the most accidents¶
fig, ax = plt.subplots(figsize = (10,5))
sns.countplot(x="Hour", data=df, orient = 'v', palette = "icefire_r")
plt.annotate('Morning Peak',xy=(6,350000), fontsize=12)
plt.annotate('Evening Peak',xy=(15,350000), fontsize=12)
plt.annotate('go to work',xy=(8,0),xytext=(0,95000),arrowprops={'arrowstyle':'-|>'}, fontsize=12)
plt.annotate('get off work',xy=(17,0),xytext=(19,95000),arrowprops={'arrowstyle':'-|>'}, fontsize=12)
plt.title('The time period with the most accidents')
plt.show()
#The number of accidents increases significantly starting from around 5 AM, peaking at 8 AM.
#This peak corresponds to the morning rush hour when many people are commuting to work or school. Increased traffic volume during this time likely contributes to the higher number of accidents.
data = df
data['Start_Time'] = pd.to_datetime(data['Start_Time'])
data['Month'] = data['Start_Time'].dt.to_period('M')
monthly_accidents = data.groupby('Month').size().reset_index(name='Accidents')
monthly_accidents['Month'] = monthly_accidents['Month'].dt.to_timestamp()
fig = px.line(monthly_accidents, x='Month', y='Accidents', title='Monthly Number of Accidents',
labels={'Month': 'Month', 'Accidents': 'Number of Accidents'},
template='plotly_dark')
fig.update_traces(line_color='cyan', line_width=2)
fig.update_layout(title_font_size=24, title_x=0.5)
fig.show()
#There appear to be periodic peaks and troughs, suggesting possible seasonal effects on the number of accidents.
#For instance, certain months may see higher accident rates due to adverse weather conditions, holidays, or other events that increase traffic volume.
Machine Learning & Predict¶
Data Cleaning¶
#Features 'ID' doesn't provide any useful information about accidents themselves. 'TMC', 'Distance(mi)', 'End_Time' (we have start time), 'Duration', 'End_Lat', and 'End_Lng'(we have start location)
#can be collected only after the accident has already happened and hence cannot be predictors for serious accident prediction.
df = df.drop(['ID','Start_Time', 'Start_Lat','Start_Lng','Description','Distance(mi)', 'End_Time', 'Duration',
'End_Lat', 'End_Lng','Weather_Timestamp'], axis=1)
#categorical columns
cat_names = [ 'Country', 'Timezone', 'Amenity', 'Bump', 'Crossing',
'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop', 'Sunrise_Sunset',
'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight']
print("Unique count of categorical features:")
for i in cat_names:
print(i,df[i].unique().size)
Unique count of categorical features: Country 1 Timezone 5 Amenity 2 Bump 2 Crossing 2 Give_Way 2 Junction 2 No_Exit 2 Railway 2 Roundabout 2 Station 2 Stop 2 Traffic_Calming 2 Traffic_Signal 2 Turning_Loop 1 Sunrise_Sunset 3 Civil_Twilight 3 Nautical_Twilight 3 Astronomical_Twilight 3
#Drop 'Country' and 'Turning_Loop' for they have only one class.
df = df.drop(['Country','Turning_Loop'], axis=1)
Correlations¶
num_corr = df.select_dtypes(include = ['float64','int64']).corr()
sns.heatmap(num_corr)
<Axes: >
#The 'Severity' of incidents shows some positive correlation with 'Precipitation(in)' and 'Temperature(F)', although these correlations are not very strong.
#There appears to be a weak negative correlation with 'Visibility(mi)', suggesting that lower visibility might be associated with higher severity, but the effect is not very pronounced.
Calculating Cramer's statistics for categorical columns¶
import scipy.stats as stats
df['Severity_Label'] = df['Severity'].apply(lambda x: 'Very High' if x == 4 else ('High' if x == 3 else ('Medium' if x == 2 else 'Low')))
categorical_features = ['Severity_Label','Street','Weather_Bin','State']
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x, y)
chi2 = stats.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
results = pd.DataFrame(index=categorical_features, columns=categorical_features)
for i in range(len(categorical_features)):
for j in range(len(categorical_features)):
if i == j:
results.iloc[i, j] = np.nan # Diagonal
else:
results.iloc[i, j] = cramers_v(df[categorical_features[i]], df[categorical_features[j]])
results = results.astype(float)
print("Cramér's V Matrix:")
print(results)
plt.figure(figsize=(10, 8))
sns.heatmap(results, annot=True, cmap='viridis', cbar=True)
plt.title("Cramér's V Heatmap for Categorical Variables")
plt.show()
Cramér's V Matrix:
Severity_Label Street Weather_Bin State
Severity_Label NaN 0.350105 0.033211 0.173223
Street 0.350105 NaN 0.232840 0.748451
Weather_Bin 0.033211 0.232840 NaN 0.124383
State 0.173223 0.748451 0.124383 NaN
#There is a moderate association between the severity label of accidents and the street where the accidents occurred.
#This indicates that certain streets might have a higher tendency to experience accidents of particular severities.
Models & Predict¶
Handling Missing Data¶
print(df[['Street', 'State', 'City', 'Weather_Bin', 'Hour', 'Severity']].isnull().sum())
missing = pd.DataFrame(df[['Street', 'State', 'City', 'Weather_Bin', 'Hour', 'Severity']].isnull().sum()).reset_index()
missing.columns = ['Feature', 'Missing_Percent(%)']
missing['Missing_Percent(%)'] = missing['Missing_Percent(%)'].apply(lambda x: x / df.shape[0] * 100)
missing.loc[missing['Missing_Percent(%)']>0,:].sort_values(by = 'Missing_Percent(%)')
Street 10093 State 0 City 168 Weather_Bin 0 Hour 0 Severity 0 dtype: int64
| Feature | Missing_Percent(%) | |
|---|---|---|
| 2 | City | 0.003729 |
| 0 | Street | 0.224034 |
#Imputing City and Street columns with KNN Imputer Method
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
label_encoders = {}
for column in ['City', 'Street']:
le = LabelEncoder()
df[column] = le.fit_transform(df[column].astype(str)) # Convert to string to handle NaNs
label_encoders[column] = le
# Apply KNNImputer to the relevant columns
imputer = KNNImputer(n_neighbors=5)
df[['City', 'Street']] = imputer.fit_transform(df[['City', 'Street']])
# Inverse transform the encoded columns back to original categorical format
for column in ['City', 'Street']:
df[column] = label_encoders[column].inverse_transform(df[column].round().astype(int))
df[['City','Street']].isnull().sum()
City 0 Street 0 dtype: int64
df = df.sample(n=500000, random_state=42)
#Due to hardware limitations, model validation techniques such as cross-validation could not be performed. Therefore, models were evaluated directly using the training and test sets.
# Select the relevant columns
model_df = df[['Street', 'State', 'City', 'Weather_Bin', 'Hour', 'Severity']]
X = model_df.drop('Severity', axis=1)
y = model_df['Severity']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.shape
(350000, 5)
X_test.shape
(150000, 5)
#Encoding
#Due to the variables in the table having more than one hundred categories, label encoding was used instead of the one-hot encoding module.
#label encoding
for column in ['Street', 'State', 'City', 'Weather_Bin']:
le = LabelEncoder()
X_train[column] = le.fit_transform(X_train[column])
label_encoders[column] = le
for column in ['Street', 'State', 'City', 'Weather_Bin']:
le = LabelEncoder()
X_test[column] = le.fit_transform(X_test[column])
label_encoders[column] = le
X_test
| Street | State | City | Weather_Bin | Hour | |
|---|---|---|---|---|---|
| 5453590 | 24831 | 8 | 5243 | 1 | 17 |
| 7022755 | 23024 | 6 | 1043 | 0 | 13 |
| 4041934 | 30251 | 38 | 2155 | 0 | 18 |
| 5641001 | 36858 | 3 | 2012 | 0 | 15 |
| 4802439 | 16128 | 29 | 4916 | 0 | 12 |
| ... | ... | ... | ... | ... | ... |
| 6430936 | 16577 | 38 | 1001 | 0 | 13 |
| 6824164 | 277 | 16 | 5713 | 0 | 22 |
| 1045717 | 29597 | 32 | 1548 | 0 | 1 |
| 4589986 | 42146 | 25 | 2182 | 0 | 16 |
| 867607 | 26627 | 41 | 7022 | 1 | 16 |
150000 rows × 5 columns
X_train
| Street | State | City | Weather_Bin | Hour | |
|---|---|---|---|---|---|
| 5895872 | 42118 | 8 | 6381 | 0 | 15 |
| 5669769 | 61759 | 32 | 7766 | 0 | 13 |
| 6549343 | 326 | 40 | 3042 | 1 | 21 |
| 4748814 | 27839 | 43 | 8194 | 0 | 15 |
| 6918065 | 30669 | 22 | 5183 | 1 | 15 |
| ... | ... | ... | ... | ... | ... |
| 4324204 | 56910 | 3 | 5839 | 0 | 13 |
| 594911 | 19609 | 18 | 6312 | 1 | 8 |
| 5149718 | 48787 | 41 | 4337 | 1 | 17 |
| 3981790 | 21639 | 3 | 7115 | 0 | 14 |
| 1531252 | 12454 | 21 | 6943 | 1 | 17 |
350000 rows × 5 columns
#Logistic Regression
log_model = LogisticRegression(max_iter=1000).fit(X_train,y_train)
y_pred = log_model.predict(X_test)
logistic_accuracy = accuracy_score(y_test, y_pred)
logistic_accuracy
0.8720666666666667
#Decision Tree Classifier
Dcf_model = DecisionTreeClassifier().fit(X_train,y_train)
y_pred = Dcf_model.predict(X_test)
Dcf_accuracy = accuracy_score(y_test, y_pred)
Dcf_accuracy
0.7687666666666667
# Random Forest Classsifier
Rf_model = RandomForestClassifier().fit(X_train,y_train)
y_pred = Rf_model.predict(X_test)
Rf_accuracy = accuracy_score(y_test, y_pred)
Rf_accuracy
0.8672733333333333
models = {'Logistic Regression Classifier':LogisticRegression(),
'Decision Tree Classifier' :DecisionTreeClassifier() ,
'Random Forest Classifier':RandomForestClassifier()}
results = []
for model_name, model in models.items():
# Fit the model
model.fit(X_train, y_train)
# Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Append the result
results.append((model_name, accuracy))
# Convert results to DataFrame and display
results_df = pd.DataFrame(results, columns=['Model', 'Accuracy'])
# Find the best model
best_model_name = results_df.loc[results_df['Accuracy'].idxmax(), 'Model']
best_model_score = results_df['Accuracy'].max()
print(f'The best model is: {best_model_name} with an accuracy of {best_model_score}')
# Plot the results
plt.figure(figsize=(10, 6))
plt.bar(results_df['Model'], results_df['Accuracy'], color='skyblue')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Model Comparison')
plt.ylim(0, 1)
plt.show()
The best model is: Logistic Regression Classifier with an accuracy of 0.8720666666666667
#Accuracy: The Logistic Regression Classifier achieved the highest accuracy (approximately 87.2%).
#Logistic Regression is performing better than both Decision Tree and Random Forest Classifiers.
Cluster Analysis¶
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
model_df.reset_index(drop=True,inplace = True)
kdf = model_df.copy()
for column in ['Street', 'State', 'City', 'Weather_Bin']:
lekmeans = LabelEncoder()
kdf[column] = lekmeans.fit_transform(kdf[column])
label_encoders[column] = lekmeans
kmeans = KMeans()
visu = KElbowVisualizer(kmeans, k = (2,20))
visu.fit(kdf)
visu.poof()
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
#Now We use the best cluster number
kmeans = KMeans(n_clusters = 5).fit(kdf)
kmeans
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
clusters = kmeans.labels_
Severity_clusters = pd.DataFrame({"Severity": kdf.Severity, "Cluster": clusters})
State_clusters = pd.DataFrame({"State": model_df.State, "Cluster": clusters})
kdf['Cluster Numbers '] = clusters
kdf
| Street | State | City | Weather_Bin | Hour | Severity | Cluster Numbers | |
|---|---|---|---|---|---|---|---|
| 0 | 419 | 3 | 1359 | 0 | 15 | 2 | 2 |
| 1 | 62003 | 3 | 1604 | 1 | 15 | 2 | 3 |
| 2 | 45170 | 8 | 9068 | 4 | 7 | 2 | 0 |
| 3 | 34088 | 41 | 5212 | 1 | 15 | 3 | 0 |
| 4 | 33893 | 41 | 5212 | 1 | 17 | 3 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 499995 | 31577 | 41 | 3133 | 1 | 17 | 3 | 0 |
| 499996 | 33731 | 8 | 412 | 0 | 17 | 2 | 0 |
| 499997 | 34119 | 33 | 4726 | 1 | 15 | 3 | 0 |
| 499998 | 33561 | 3 | 7011 | 4 | 8 | 2 | 0 |
| 499999 | 86526 | 36 | 7927 | 4 | 2 | 2 | 1 |
500000 rows × 7 columns
SUMMARY¶
- The dataset mentioned in the above link was reduced to cover car accidents in 49 states of the USA from 2020 to 2022. Data cleaning was performed, and the analysis began.
- Descriptive statistics provided an overview of the data, and exploratory data analysis methods were used to generate relevant graphs. It was observed that accident rates increased by 49.5% between 2020 and 2022. Most accidents occurred in clear weather, and the three states with the highest number of accidents were Los Angeles, Miami-Dade, and Orange, respectively.
- The model was evaluated using three different classification models (Logistic Regression, Decision Tree, and Random Forest). The best classification model was determined to be Logistic Regression with an accuracy of 87.2%.
- As an unsupervised learning model, the KMeans clustering algorithm was used, and the data was labeled into 5 clusters.
Created BY¶
Ayşegül ÜLKER